landsat spectral clustering

Spectral Clustering Example.

The image loaded here is a cropped portion of the MERCATOR_LC80210392016114LGN00_B10.TIF LANDSAT image included as a public datashader example.

In addition to dask-ml, we'll use rasterio to read the data and matplotlib to plot the figures. I'm just working on my laptop, so we could use either the threaded or distributed scheduler. I'll use the distributed scheduler for the diagnostics.

In [1]:
import rasterio
import holoviews as hv
from holoviews.operation.datashader import regrid
import dask.array as da
from dask_ml.cluster import SpectralClustering
from dask.distributed import Client
hv.extension('bokeh')
In [2]:
with rasterio.open('landsat-sample.tiff') as dataset:
    arr = dataset.read(1)

arr = arr.astype(float)
# Rescale for the clustering algorithm
arr = (arr - arr.mean()) / arr.std()
In [3]:
%%opts Image (cmap='viridis')
regrid(hv.Image(arr))
Out[3]:
In [4]:
client = Client(processes=False)
client
Out[4]:

Client

Cluster

  • Workers: 1
  • Cores: 8
  • Memory: 17.18 GB

We'll reshape the image to be how dask-ml / scikit-learn expect it: (n_samples, n_features) where n_features is 1 in this case. Then we'll persist that in memory. We still have a small dataset at this point. The large dataset, which dask helps us manage, is the intermediate n_samples x n_samples array that spectral clustering operates on. For our 2,500 x 2,500 pixel subset, that's ~50

In [5]:
X = da.from_array(arr.reshape(-1, 1), chunks=100_000)
X = client.persist(X)

And we'll fit the estimator.

In [6]:
clf = SpectralClustering(n_clusters=4, random_state=0,
                         kmeans_params={'init_max_iter': 5})
In [7]:
%time clf.fit(X)
INFO:dask_ml.cluster.spectral:k-means for assign_labels[starting]
INFO:dask_ml.cluster.k_means:Finished check_array in 41.41 s
INFO:dask_ml.cluster.k_means:Initializing with k-means||
INFO:dask_ml.cluster.k_means:init iteration  1/ 5 39.47 s,  2 centers
INFO:dask_ml.cluster.k_means:init iteration  2/ 5 40.82 s,  4 centers
INFO:dask_ml.cluster.k_means:init iteration  3/ 5 40.74 s,  7 centers
INFO:dask_ml.cluster.k_means:init iteration  4/ 5 40.44 s,  8 centers
INFO:dask_ml.cluster.k_means:init iteration  5/ 5 41.46 s,  8 centers
INFO:dask_ml.cluster.k_means:Finished initialization. 483.94 s,  4 centers
INFO:dask_ml.cluster.k_means:Lloyd loop  0. Shift: 0.0981 [46.35 s]
INFO:dask_ml.cluster.k_means:Lloyd loop  1. Shift: 0.0026 [45.65 s]
INFO:dask_ml.cluster.k_means:Lloyd loop  2. Shift: 0.0000 [45.18 s]
INFO:dask_ml.cluster.spectral:k-means for assign_labels[finished]
CPU times: user 24min 30s, sys: 36min 31s, total: 1h 1min 2s
Wall time: 11min 49s
Out[7]:
SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,
          eigen_solver=None, eigen_tol=0.0, gamma=1.0, kernel_params=None,
          kmeans_params={'init_max_iter': 5}, n_clusters=4,
          n_components=100, n_init=10, n_jobs=1, n_neighbors=10,
          persist_embedding=False, random_state=0)
In [8]:
labels = clf.assign_labels_.labels_.compute()
c = labels.reshape(arr.shape)
In [9]:
%%opts Image (cmap='viridis')
regrid(hv.Image(arr)).relabel('Image') + regrid(hv.Image(c)).relabel('Clustered')
Out[9]:

Right click to download this notebook from GitHub.